12. Spark Project Overview

Spark Project: Sparkify

L0 04 Project V2

What will I learn?

You'll learn how to manipulate large and realistic datasets with Spark to engineer relevant features for predicting churn. You'll learn how to use Spark MLlib to build machine learning models with large datasets, far beyond what could be done with non-distributed technologies like scikit-learn.

Career Relevance

Predicting churn rates is a challenging and common problem that data scientists and analysts regularly encounter in any customer-facing business. Additionally, the ability to efficiently manipulate large datasets with Spark is one of the highest-demand skills in the field of data.

Essential Skills

  • Load large datasets into Spark and manipulate them using Spark SQL and Spark Dataframes
  • Use the machine learning APIs within Spark ML to build and tune models
  • Integrate the skills you've learned in the Spark course and the Data Scientist Nanodegree program

Take the Spark Course

You can find the Spark course in your Extracurriculars section here.

Project Instructions

The full dataset is 12GB, of which you can analyze a mini subset in the workspace on the following page. Optionally, you can choose to follow the instructions in the Extracurricular course to deploy a Spark cluster on the cloud using AWS or IBM Cloud to analyze a larger amount of data. Currently we have the full 12GB dataset available to you if you use AWS. If you use IBM, you can download a medium sized dataset to upload to your cluster.

Details on how to do this using AWS or IBM Cloud are included in the last lesson of the Extracurricular Spark Course content linked above. Note that this part is optional, and you will not receive credits to fund your deployment. You can do the IBM portion for free. Using AWS will cost you around $30 if you run a cluster up for a week with the settings we provide.

Once you've built your model, either in the classroom workspace or in the cloud with AWS or IBM, download your notebook and complete the remaining components of your Data Scientist Capstone project, including thorough documentation in a README file in your Github repository, as well as a web app or blog post explaining the technical details of your project. Be sure to review the Project Rubric thoroughly before submitting your project.

Submission Instructions

Create a GitHub repository for this project, containing your notebook and README file. Once your project is finished, submit the URL of this repository.

Useful Links

Udacity Project FAQ
Python PEP8 Style Guide